How complete is the CDC's COVID-19 case surveillance data for race/ethnicity for states and counties?

Status: Draft

Jan 26, 2021

Contact [email here]

Background

The disparities in the COVID-19 pandemic along racial and ethnic lines has exposed longstanding health inequities in the U.S., which has been made clear by multiple data analyses (CRDT, NYT, APM, KFF, NPR, HHS/ASPE). However, the data landscape for race/ethnicity breakdowns has largely been fragmented and left to non-governmental organizations collecting data from individual state public health websites. The CDC has only published public race/ethnicity data for cases at the U.S. level, not the state or county levels. ASPE, an agency within HHS, wrote in Oct 2020 that "The volunteer-based COVID tracking project has created the most comprehensive centralized resource for race and ethnicity data at the state level."

In July 2020, the New York Times (NYT) published The Fullest Look Yet at the Racial Inequity of Coronavirus, a one-time analysis of data from the CDC obtained via FOIA and legal action that contained county-level case data with race/ethnicity up to May 28, 2020. They did not release the raw data, and they didn't publish any updated analyses. While several non-governmental organizations have taken it upon themselves to gather data for total case counts at the county level (NYT, JHU, USAFacts), none of them have collected race/ethnicity data, which would be a huge undertaking due to the non-uniformity of race/ethnicity categories in state and county public health websites.

In Nov 2020, the CDC made some of the case data that the NYT obtained public: county-level totals in a dashboard and public data about race/ethnicity without state and county details. They also released a restricted access dataset with race/ethnicity, state, and county available upon request. The initial data agreement did not allow county-level analyses to be made public, but an updated data agreement from Dec 14, 2020 allowed such public analyses. In Jan 2021, the Morehouse School of Medicine's Satcher Health Leadership Institute (MSM/SHLI), in collaboration with a partner orgaization and Google.org, applied for and got access to this data.

The CDC Restricted Access data will enable the second public analysis of race/ethnicity disparities across the U.S. at the county level since the NYT analysis in July. However, as with the CDC data that the NYT obtained, the CDC Restricted Access data has several issues with data completeness. On most measures below, the CDC data is more complete than the NYT/CDC data from a little over 6 months earlier.

Screen Shot 2021-01-26 at 10.15.52 AM.png

Overview

The goal of this analysis is to assess the completeness of the CDC's Restricted Access dataset and its feasibility in examining disparities in race/ethnicity for COVID-19 cases at the county level. We will first assess the completeness of the dataset on its own by looking at which fields are viable for analysis. We will next compare the total case counts in the restricted access dataset to two comparable public datasets at the state and county levels. We will also compare the cases with known race/ethnicity at the state level to the Covid Tracking Project's data.

The top-level data completeness findings are:

  1. Data fields: Most fields in the CDC's Restricted Access data are missing too many values to be useful. The only ones that we used were state, county, age, sex, and race/ethnicity. Race/ethnicity was only known for 55% of cases, as opposed to 97%-100% for the others.
  2. Total case count: The CDC data does not contain the total number of cases expected for some states and counties as compared to comparable data. While it is expected that the CDC will lag in some cases, the time lag alone can't explain some of the discrepancies.
  3. Race/ethnicity: Race/ethnicity data availability is highly variable across different states, and this is common to both the CDC and Covid Tracking Project's data. However, there are discrepancies across the two datasets in the availability of race/ethnicity at the state level.

After examining the completeness of the data, we will finally examine race/ethnicity disparities at the county level for data up to Dec 16.

Note that we will not analyze any data about deaths or hospitalizations. While there are fields in this data that indicate if the person died or was hospitalized, they are missing too much data to be reliable. There are also alternate data sources, such as the CDC Provisional Deaths data, which are based on different underlying data that, at first glance, seems to be more complete than the case data we are looking at here.

Completeness Analysis

Data Fields

The restricted access dataset contains 32 fields, which are described on the CDC website. The public version of the restricted access data contains 19 of those fields. The data comes from this case report form that is a dense, two-page that appears to have only been partially filled in or entered into the CDC data for many cases. The CDC has extensive FAQs about this surveillance data, one of which is about completeness:

How complete are the data that the CDC receives about COVID-19 cases?

The COVID-19 pandemic has put unprecedented demands on the public health data supply chain. In many states, the large number of COVID-19 cases has severely strained the ability of hospitals, healthcare providers, and laboratories to report cases with complete demographic information, such as race and ethnicity. The unprecedented volume of cases has also limited the ability of state and local health departments to conduct thorough case investigations and collect all requested case data.

As a result, many COVID-19 case notifications submitted to CDC do not have complete information on patient demographics; signs and symptoms of illness; underlying health conditions; characteristics of hospitalizations such as ventilator use; clinical outcomes; exposures; and factors that may put people at higher risk for severe disease. Because it can be time-consuming for jurisdictions to collect the additional information, these data can lag behind the aggregate counts. Because of missing data, analyses of these data elements are likely an underestimate of the true occurrence.

Most states have demographic factors like age and sex for the majority of reported cases. With thousands of cases being reported, however, completeness of these elements is unlikely to improve in the immediate future for some jurisdictions.

Because the racial and ethnic composition of the U.S. population varies by geographic area, comparisons of COVID-19 case information should consider the population of each geographic area. Additionally, because completeness of race and ethnicity information may vary by state or geographic area and other patient factors, such as severity of illness, CDC’s case data may not be generalizable to the entire U.S. population.

CDC data fields

In [ ]:
#@title
import pandas as pd
import altair as alt
%load_ext google.colab.data_table

from google.colab import auth
auth.authenticate_user()

alt.renderers.set_embed_options(actions=False)
In [ ]:
#@title
def FieldAnalysis(project_id, table, field_list):
  dict = {}
  for field in field_list:
      dict[field] = [0.0, 0.0, 0.0, 0.0]
  unknowns = pd.DataFrame(dict, index=['Unknown', 'Missing', 'NA', 'Known'])
  field_series = []
  value_series = []
  percent_series = []

  for field in field_list:
    field_unknowns_query = ('''
    SELECT
      %s,
      count(*) as cases
    FROM
      %s
    GROUP BY
      %s
    ''')
    query = field_unknowns_query % (field, table, field)
    field_unknowns_df = pd.io.gbq.read_gbq(query, project_id=project_id)
    field_unknowns_df.set_index(field, inplace=True)
    field_unknowns_df.index = field_unknowns_df.index.fillna('Null')

    missing_count = 0
    if 'Missing' in field_unknowns_df.index:
      missing_count += field_unknowns_df.loc['Missing'].cases
    if 'Null' in field_unknowns_df.index:
      missing_count += field_unknowns_df.loc['Null'].cases
    #if field_unknowns_df.index.isnull().any():
    #  missing_count += field_unknowns_df.loc[field_unknowns_df.index.isnull()].cases
    unknowns.loc['Missing', field] = missing_count / field_unknowns_df.cases.sum()

    if 'Unknown' in field_unknowns_df.index:
      unknowns.loc['Unknown', field] = field_unknowns_df.loc['Unknown'].cases / field_unknowns_df.cases.sum()
    if 'NA' in field_unknowns_df.index:
      unknowns.loc['NA', field] = field_unknowns_df.loc['NA'].cases / field_unknowns_df.cases.sum()
    unknowns.loc['Known', field] = 1 - (unknowns.loc['Missing', field] +
                                        unknowns.loc['Unknown', field] +
                                        unknowns.loc['NA', field])
    field_series.extend([field, field, field, field])
    value_series.extend(['Known', 'Supressed', 'Unknown', 'Missing'])
    percent_series.extend([unknowns.loc['Known', field],
                           unknowns.loc['NA', field],
                           unknowns.loc['Unknown', field],
                           unknowns.loc['Missing', field]])
  test = pd.DataFrame.from_dict({'field': field_series,
                               'value': value_series,
                               'percent': percent_series})
  return alt.Chart(test).mark_bar().encode(
      x=alt.X('percent', axis=alt.Axis(format='%')),
      y=alt.Y('field', sort='x'),
      color=alt.Color('value', scale=alt.Scale(scheme='category20')),
      order=alt.Order('field:N'),
      tooltip=[
                  alt.Tooltip('field:N', title='Field'),
                  alt.Tooltip('value:N', title='Value'),
                  alt.Tooltip('percent:Q', format=',.0%', title='Percent'),
      ]
  )

Based on our analysis of the CDC data up to Dec 16, 2020, the only fields that are available for more than 50% of the cases are the date that the case was first reported to the CDC, the status of the case (lab-confirmed or probable), state, county, sex, age, and race/ethnicity, which are shown in the chart below. All other fields, including whether the person died or was hospitalized, are known for fewer than 50% of the cases.

In [ ]:
#@title
field_list = ['cdc_case_earliest_dt', 'current_status', 'res_state', 'res_county', 'sex', 'age_group', 'race_ethnicity_combined']
project_id = 'msm-secure-data-1b'
table = '`msm-secure-data-1b.ndunlap_secure.cdc_restricted_access_20201231`'
FieldAnalysis(project_id, table, field_list).display()

Race/ethnicity is known for only 55% of cases, while the other fields above are known for 97%-99% of cases. The 45% of cases with without known race/ethnicity were either marked as "Unknown" on the case report form (35%), missing due to being left blank on the form (4%), or suppressed for privacy reasons for small geographic/demographic population groups (2%).

The remaining fields, including whether the person died or was hospitalized, are all known for fewer than 50% of cases.

In [ ]:
#@title
field_list = ['death_yn', 'hosp_yn', 'icu_yn', 'onset_dt', 'pos_spec_dt', 'hc_work_yn',
              'pna_yn', 'abxchest_yn', 'acuterespdistress_yn', 'mechvent_yn', 'fever_yn', 'sfever_yn', 'chills_yn', 'myalgia_yn', 'runnose_yn',
              'sthroat_yn', 'cough_yn', 'sob_yn', 'nauseavomit_yn', 'headache_yn', 'abdom_yn', 'diarrhea_yn', 'medcond_yn']
project_id = 'msm-secure-data-1b'
table = '`msm-secure-data-1b.ndunlap_secure.cdc_restricted_access_20201231`'
FieldAnalysis(project_id, table, field_list).display()

Partner Org/CDC data fields

The case report form contains many more fields, but unfortunately, the data gets less complete as you go down the form. A partner organization obtained a version of this data that contains 101 fields with data up to Aug 25, 2020. Several of the additional fields from that dataset are shown below; the field with the most known data is whether the case was associated with an outbreak, but even that is only known for 30% of cases.

In [ ]:
#@title
field_list = ['death_week', 'icu_length', 'hosp_length', 'translator_yn', 'housing', 'exp_work_critical', 'outbreak_associated',
              'rigors_yn', 'taste_yn', 'fatigue_yn', 'wheezing_yn', 'diffbreathing_yn', 'chestpain_yn', 'test_pcr', 'test_serologic',
              'exp_adultfacility', 'exp_airport', 'exp_animal', 'exp_community', 'exp_gathering', 'exp_contact', 'exp_correctional',
              'exp_ship', 'exp_house', 'exp_other', 'exp_school', 'exp_othcountry', 'exp_unk', 'exp_work']
project_id = 'msm-internal-data'
table = '`msm-internal-data.crew.covid_case_surveillance`'
FieldAnalysis(project_id, table, field_list).display()

Total Case Count

The first step to evaluating the completeness of the CDC's Restricted Access Dataset is to check the total case counts at the U.S., state, and county levels. We have chosen to compare against the Covid Tracking Project's Racial Data Tracker (CRDT) and the NYT's public data, which are updated on a regular basis (CRDT twice a week, NYT daily) and come from state and local public health websites or agencies. CRDT is the only source for case data with race/ethnicity breakdowns, but there are several sources for county-level total data in addition to the NYT, such as JHU and USAFacts (This paper analyzes the differences at the state level up to July for cases and deaths).

The table below compares geographic vs. race/ethnicity availability for these three different data sources:

  • CDC: CDC Case Surveillance Restricted Access Data
  • CRDT: Covid Racial Data Tracker Public Data
  • NYT: New York Times COVID-19 Public Data

Screen Shot 2021-01-26 at 1.26.55 AM.png

Because the CDC data is the only dataset that has race/ethnicity at the county level, the most similar datasets for purposes of comparison are (1) the CRDT at the state level with race/ethnicity, and (2) the NYT data at the county level with no race/ethnicity.

We will compare all datasets up to Dec 16, 2020, which is the latest reporting date in the CDC data. We expect to see slight variations in the total case counts due to a possible lag of a few weeks in states or counties reporting to the CDC, but we don't expect to see huge discrepancies due to that lag.

Baseline: NYT vs. CRDT

To get a baseline of how much we can expect the CDC total case counts to match the CRDT or NYT, we can see how closely the CRDT and NYT match each other. The black line shows where the case counts are equal; most states fall close to that line (hover over the dots to see states).

In [ ]:
#@title
CASES = 'Cases'
HOSPITALIZATIONS = 'Hospitalizations'
DEATHS = 'Deaths'
HCW_CASES = 'Healthcare Worker Cases'

DATASET = 'cdc'
#DATASET = 'crew'

metric = CASES
#metric = HOSPITALIZATIONS
#metric = DEATHS
#metric = HCW_CASES

project_id = 'msm-secure-data-1b'
table = '`msm-secure-data-1b.ndunlap_secure.cdc_restricted_access_20201231`'
date = 'Dec 16'

if DATASET == 'crew':
  project_id = 'msm-internal-data'
  table = '`msm-internal-data.crew.covid_case_surveillance`'
  date = 'Aug 11 (CREW)'

race_ethnicity_groups = ['black', 'hispanic', 'aian', 'nhpi', 'asian', 'white', 'other']
#race_ethnicity_groups = ['black', 'white'] # for hc_work_yn coverage
In [ ]:
#@title
states_to_fips = {'AL': 1, 'AK': 2, 'AZ': 4, 'AR': 5, 'AS': 3, 'CA': 6, 'CO': 8, 'CT': 9, 'DC': 11, 'DE': 10, 'FL': 12, 'GA': 13, 'GU': 14, 'HI': 15, 'ID': 16, 'IL': 17, 'IN': 18, 'IA': 19, 'KS': 20, 'KY': 21, 'LA': 22, 'ME': 23, 'MD': 24, 'MA': 25, 'MI': 26, 'MN': 27, 'MS': 28, 'MO': 29, 'MT': 30, 'NE': 31, 'NV': 32, 'NH': 33, 'NJ': 34, 'NM': 35, 'NY': 36, 'NYC': 36, 'NC': 37, 'ND': 38, 'OH': 39, 'OK': 40, 'OR': 41, 'PA': 42, 'PR': 43, 'RI': 44, 'SC': 45, 'SD': 46, 'TN': 47, 'TX': 48, 'UT': 49, 'VT': 50, 'VA': 51, 'VI': 52, 'WA': 53, 'WV': 54, 'WI': 55, 'WY': 56, 'AS': 60, 'GU': 66, 'MP': 69, 'PR': 72, 'VI': 78}

crdt_query = ('''
SELECT
  State as state,
  Cases_Total as crdt_cases,
  #Deaths_Total as crdt_deaths,
  Cases_Total - Cases_Unknown as crdt_known_cases,
  #Deaths_Total - Deaths_Unknown as crdt_known_deaths,
  ROUND(1 - Cases_Unknown / Cases_Total, 4) as crdt_known_cases_percent,
  #ROUND(1 - Deaths_Unknown / Deaths_Total, 4) as crdt_known_deaths_percent,  
FROM `msm-secure-data-1b.ndunlap_secure.crdt`
WHERE
  date = 20201216
''')
crdt_df = pd.io.gbq.read_gbq(crdt_query, project_id=project_id)
crdt_df.set_index('state', inplace=True)
In [ ]:
#@title
nyt_states_query = ('''
SELECT
  state_name,
  state_fips_code,
  confirmed_cases as nyt_cases,
  deaths as nyt_deaths
FROM `bigquery-public-data.covid19_nyt.us_states`
WHERE
  date = DATE(2020, 12, 16) AND
  state_fips_code IS NOT NULL
''')
nyt_states_df = pd.io.gbq.read_gbq(nyt_states_query, project_id=project_id)
nyt_states_df.state_fips_code.unique()
nyt_states_df = nyt_states_df[nyt_states_df.state_name != 'Puerto Rico']
nyt_states_df = nyt_states_df[nyt_states_df.state_name != 'Guam']
nyt_states_df = nyt_states_df[nyt_states_df.state_name != 'Virgin Islands']
nyt_states_df = nyt_states_df[nyt_states_df.state_name != 'Northern Mariana Islands']
nyt_states_df = nyt_states_df[nyt_states_df.state_name != 'American Samoa']
nyt_states_df['state_fips_code'] = nyt_states_df.state_fips_code.astype(int)
nyt_states_df.set_index('state_fips_code', inplace=True)
In [ ]:
#@title
crdt_df.reset_index(inplace=True)
crdt_df['state_fips_code'] = crdt_df.state
crdt_df = crdt_df.replace(to_replace={'state_fips_code': states_to_fips})
crdt_df.set_index('state_fips_code', inplace=True)
nyt_crdt_merged_df = nyt_states_df.join(crdt_df, on="state_fips_code", how='inner', lsuffix='_left', rsuffix='_right')
In [ ]:
#@title
nyt_crdt_merged_df['percent'] = round(nyt_crdt_merged_df.nyt_cases / nyt_crdt_merged_df.crdt_cases, 2)
nyt_crdt_merged_df
nyt_crdt_merged_df.reset_index(inplace=True)
nyt_crdt_merged_df.percent.describe()
In [ ]:
#@title
tooltips = [alt.Tooltip('state:N', title='State'),
              alt.Tooltip('nyt_cases:Q', format=',', title='NYT cases'),
              alt.Tooltip('crdt_cases:Q', format=',', title='CRDT cases'),
              alt.Tooltip('percent:Q', format='.2f', title='Ratio of NYT to CRDT'),
]

plot = alt.Chart(nyt_crdt_merged_df).mark_circle(size=60).encode(
    alt.X('crdt_cases:Q', axis=alt.Axis(title='CRDT cases'),
        scale=alt.Scale(domain=(0, 2000000))
    ),
    alt.Y('nyt_cases:Q', axis=alt.Axis(title='NYT cases'),
        scale=alt.Scale(domain=(0, 2000000))
    ),
    color=alt.Color('percent',
                    scale=alt.Scale(scheme='blueorange',
                                    reverse=True,
                                               domain=[0, 2],
                                               clamp=True),
                    legend=alt.Legend(format='.2f'),
                    title='Ratio of NYT to CRDT'),
    tooltip=tooltips,
).properties(
    width=350,
    height=350,
)

line = pd.DataFrame({
    'x': [0, 2000000],
    'y': [0, 2000000],
})

line_plot = alt.Chart(line).mark_line(color='black').encode(
    x='x',
    y='y',
).properties(
    width=350,
    height=350,
)

(plot + line_plot).configure_mark(
    stroke='grey'
).properties(
    width=350,
    height=350,
).display()

The ratio of NYT to CRDT cases is within the range 0.97-1.11 for all states:

  • Average = 1.01
  • Median = 1.00
  • Min = 0.97 (Tennessee)
  • Max = 1.11 (Georgia)
  • Percent between 0.85 and 1.15 = 100%

We can also view these ratios on a map (hover over states for details).

In [ ]:
#@title
us_states = alt.topo_feature(data.us_10m.url, 'states')
us_counties = alt.topo_feature(data.us_10m.url+"#", 'counties')

highlight = alt.selection_single(on='mouseover', fields=['id', 'state_fips_code'], empty='none')

tooltips = [alt.Tooltip('state:N', title='State'),
              alt.Tooltip('nyt_cases:Q', format=',', title='NYT cases'),
              alt.Tooltip('crdt_cases:Q', format=',', title='CRDT cases'),
              alt.Tooltip('percent:Q', format='.2f', title='Ratio of NYT to CRDT'),
]

plot = alt.Chart(us_states).mark_geoshape(
      stroke='white',
      strokeOpacity=.2,
      strokeWidth=1
  ).project(
    type='albersUsa'
  ).transform_lookup(
      lookup='id',
      from_=alt.LookupData(nyt_crdt_merged_df, 'state_fips_code', ['percent', 'state', 'nyt_cases', 'crdt_cases'])
  ).encode(
      alt.Color('percent',  
                type='quantitative', 
                legend=alt.Legend(format='.2f'),
                scale=alt.Scale(scheme='blueorange',
                                reverse=True,
                                domain=[0, 2],
                                clamp=True,
                                ),
                title=''),
       tooltip=tooltips
  ).add_selection(
      highlight,
  )

states_outline = alt.Chart(us_states).mark_geoshape(stroke='white', strokeWidth=1.5, fillOpacity=0, fill='white').project(
      type='albersUsa'
)

states_fill = alt.Chart(us_states).mark_geoshape(
      fill='#F1F1F1',
      stroke='white'
).project('albersUsa')

layered_map = alt.layer(states_fill, plot, states_outline).properties(
      width=500,
      height=400,
      title='Ratio of NYT cases to CRDT cases as of Dec 16'
).configure_legend(
      orient='top-right',
      gradientLength=200,
      titleLimit=0,
  ).configure_view(
      strokeWidth=0,
).display()

States: CDC vs. CRDT

In [ ]:
#@title
states_to_fips = {'AL': 1, 'AK': 2, 'AZ': 4, 'AR': 5, 'AS': 3, 'CA': 6, 'CO': 8, 'CT': 9, 'DC': 11, 'DE': 10, 'FL': 12, 'GA': 13, 'GU': 14, 'HI': 15, 'ID': 16, 'IL': 17, 'IN': 18, 'IA': 19, 'KS': 20, 'KY': 21, 'LA': 22, 'ME': 23, 'MD': 24, 'MA': 25, 'MI': 26, 'MN': 27, 'MS': 28, 'MO': 29, 'MT': 30, 'NE': 31, 'NV': 32, 'NH': 33, 'NJ': 34, 'NM': 35, 'NY': 36, 'NYC': 36, 'NC': 37, 'ND': 38, 'OH': 39, 'OK': 40, 'OR': 41, 'PA': 42, 'PR': 43, 'RI': 44, 'SC': 45, 'SD': 46, 'TN': 47, 'TX': 48, 'UT': 49, 'VT': 50, 'VA': 51, 'VI': 52, 'WA': 53, 'WV': 54, 'WI': 55, 'WY': 56, 'AS': 60, 'GU': 66, 'MP': 69, 'PR': 72, 'VI': 78}
compare_cases_query = ('''
SELECT
  res_state,
  COUNT(*) as cdc_cases
FROM
  %s
GROUP BY
   res_state
''')
# Unused: for CDC vs. NYT states case totals comparison.
#states_df = pd.io.gbq.read_gbq(compare_cases_query % table, project_id=project_id)
#states_df = states_df.replace(to_replace={'res_state': states_to_fips})
#states_df = states_df[states_df.res_state != 'Unknown']
#states_df = states_df[states_df.res_state != 'NA']
#states_df = states_df[states_df.res_state != 'OCONUS']
#states_df.rename(columns={'res_state': 'state_fips_code'}, inplace=True)
#states_df['state_fips_code'] = states_df.state_fips_code.astype(int)
#states_df.set_index('state_fips_code', inplace=True)
In [ ]:
#@title
states_df = pd.io.gbq.read_gbq(compare_cases_query % table, project_id=project_id)
states_df.rename(columns={'res_state': 'state'}, inplace=True)
states_df.set_index('state', inplace=True)

crdt_query = ('''
SELECT
  State as state,
  Cases_Total as crdt_cases,
  #Deaths_Total as crdt_deaths,
  Cases_Total - Cases_Unknown as crdt_known_cases,
  #Deaths_Total - Deaths_Unknown as crdt_known_deaths,
  ROUND(1 - Cases_Unknown / Cases_Total, 4) as crdt_known_cases_percent,
  #ROUND(1 - Deaths_Unknown / Deaths_Total, 4) as crdt_known_deaths_percent,  
FROM `msm-secure-data-1b.ndunlap_secure.crdt`
WHERE
  date = 20201216
''')
crdt_df = pd.io.gbq.read_gbq(crdt_query, project_id=project_id)
crdt_df = crdt_df[crdt_df.state != 'PR']
crdt_df = crdt_df[crdt_df.state != 'GU']
crdt_df = crdt_df[crdt_df.state != 'VI']
crdt_df = crdt_df[crdt_df.state != 'MP']
crdt_df = crdt_df[crdt_df.state != 'AS']
crdt_df.set_index('state', inplace=True)
crdt_merged_df = states_df.join(crdt_df, on="state", how='inner', lsuffix='_left', rsuffix='_right')
crdt_merged_df.reset_index(inplace=True)
crdt_merged_df['state_fips_code'] = crdt_merged_df.state
crdt_merged_df = crdt_merged_df.replace(to_replace={'state_fips_code': states_to_fips})
crdt_merged_df['percent'] = round(crdt_merged_df.cdc_cases / crdt_merged_df.crdt_cases, 4)
#crdt_merged_df['percent'] = round((crdt_merged_df.cdc_cases - crdt_merged_df.crdt_cases) / crdt_merged_df.crdt_cases, 4)
#crdt_merged_df['percent'] = round(crdt_merged_df.cdc_cases / crdt_merged_df.crdt_deaths, 4)
#crdt_merged_df['percent'] = round(crdt_merged_df.cdc_cases / crdt_merged_df.crdt_known_cases, 4)
#crdt_merged_df['percent'] = round(crdt_merged_df.cdc_cases / crdt_merged_df.crdt_known_deaths, 4)
crdt_merged_df.percent.describe()
In [ ]:
#@title
crdt_merged_df.percent.sort_values()
In [ ]:
#@title
tooltips = [alt.Tooltip('state:N', title='State'),
              #alt.Tooltip('crdt_known_cases:Q', format=',', title='CDC cases'),
              #alt.Tooltip('cdc_known_cases:Q', format=',', title='CRDT cases'),
              alt.Tooltip('cdc_cases:Q', format=',', title='CDC cases'),
              alt.Tooltip('crdt_cases:Q', format=',', title='CRDT cases'),
              alt.Tooltip('percent:Q', format='.2f', title='Ratio of CDC to CRDT'),
]

plot = alt.Chart(crdt_merged_df).mark_circle(size=60).encode(
    alt.X('crdt_cases:Q',
    #alt.X('crdt_known_cases:Q',
        scale=alt.Scale(domain=(0, 2000000)),
        axis=alt.Axis(title='CRDT cases')
        #scale=alt.Scale(domain=(0, 1200000))
    ),
    alt.Y('cdc_cases:Q',
    #alt.Y('cdc_known_cases:Q',
        scale=alt.Scale(domain=(0, 2000000)),
        axis=alt.Axis(title='CDC cases')
        #scale=alt.Scale(domain=(0, 1200000))
    ),
    color=alt.Color('percent', scale=alt.Scale(scheme='blueorange',
                                               reverse=True,
                                               domain=[0, 2], clamp=True),
                    legend=alt.Legend(format='.2f'),
                    title='Ratio'
                    ),
 tooltip=tooltips,
).properties(
    width=350,
    height=350
)

line = pd.DataFrame({
    'x': [0, 2000000],
    'y': [0, 2000000],
    #'x': [0, 1200000],
    #'y': [0, 1200000],
})

line_plot = alt.Chart(line).mark_line(color= 'black').encode(
    x='x',
    y='y',
).properties(
    width=300,
    height=300
)

scatter = (plot + line_plot).properties(
    title='Ratio of CDC to CRDT cases by state as of Dec 16'
    ).configure_mark(stroke='grey')
scatter.display()

We can see that the CDC case counts differ from the CRDT case counts much more drastically than the NYT did. The ratio of NYT to CRDT cases is within the range 0.03-1.64 for all states:

  • Average = 0.79
  • Median = 0.97
  • Min = 0.03 (Tennessee)
  • Max = 1.64 (Georgia)
  • Percent between 0.85 and 1.15 = 37%

Here are the ratios shown on a map:

In [ ]:
#@title
us_states = alt.topo_feature(data.us_10m.url, 'states')
us_counties = alt.topo_feature(data.us_10m.url+"#", 'counties')

highlight = alt.selection_single(on='mouseover', fields=['id', 'state_fips_code'], empty='none')
tooltips = [alt.Tooltip('state:N', title='State'),
              alt.Tooltip('cdc_cases:Q', format=',', title='CDC cases'),
              alt.Tooltip('crdt_cases:Q', format=',', title='CRDT cases'),
              alt.Tooltip('percent:Q', format='.2f', title='Ratio of CDC to CRDT')
]

plot = alt.Chart(us_states).mark_geoshape(
      stroke='white',
      strokeOpacity=.2,
      strokeWidth=1
  ).project(
    type='albersUsa'
  ).transform_lookup(
      lookup='id',
      from_=alt.LookupData(crdt_merged_df, 'state_fips_code', ['percent', 'state', 'cdc_cases', 'crdt_cases'])
  ).encode(
      alt.Color('percent',  
                type='quantitative', 
                legend=alt.Legend(format='.2f'),
                scale=alt.Scale(scheme='blueorange',
                                reverse=True,
                                domain=[0, 2],
                                clamp=True,
                                ),
                title=''),
       tooltip=tooltips
  ).add_selection(
      highlight,
  )

states_outline = alt.Chart(us_states).mark_geoshape(stroke='white', strokeWidth=1.5, fillOpacity=0, fill='white').project(
      type='albersUsa'
)

states_fill = alt.Chart(us_states).mark_geoshape(
      fill='#F1F1F1',
      stroke='white'
).project('albersUsa')

crdt_layered_map = alt.layer(states_fill, plot, states_outline).properties(
      width=500,
      height=400,
      title='Ratio of CDC Cases to CRDT cases as of Dec 16'
)

crdt_layered_map.configure_legend(
      orient='top-right',
      gradientLength=200,
      titleLimit=0,
  ).configure_view(
      strokeWidth=0,
).display()

The 19 states (37% of total) that were within +/- 0.15 of the CRDT data could plausibly be off due to time lags in reporting cases to the CDC vs. reporting them on state public health websites. However, there are many outlier states that are too far off from the CRDT case counts to be explained by a time lag:

  • 14 states: < 0.5 ratio of CDC to CRDT cases
  • 1 state > 1.5 ratio of CDC to CRDT cases (Alaska)
In [ ]:
#@title
crdt_merged_df.percent.sort_values()

Counties: CDC vs. NYT

In [ ]:
#@title
# CDC vs. NYT county

cases_query = ('''
SELECT
  res_state,
  res_county,
  race_ethnicity_combined,
  COUNT(*) as cases
FROM
  %s
GROUP BY
   res_county,
   res_state,
   race_ethnicity_combined
''')
df = pd.io.gbq.read_gbq(cases_query % table, project_id=project_id)
df = df[df.res_state != 'PR']
df = df[df.res_state != 'GU']
df = df[df.res_state != 'VI']
df = df[df.res_state != 'MP']
df = df[df.res_state != 'AS']
In [ ]:
#@title
# CDC vs. NYT county

project_id = 'msm-internal-data'

df_county_fips_map = pd.io.gbq.read_gbq(f'''
SELECT
*
FROM
#  `msm-internal-data.ipums_acs.acs_2019_5year_county`
  `msm-internal-data.crew.county_fips_mapping`
''', project_id=project_id)

df_county_fips_map.crew_county = df_county_fips_map.crew_county.str.lower()
df_county_fips_map['state_county'] = df_county_fips_map.state + '-' + df_county_fips_map.crew_county
df_county_fips_map['state_county'] = df_county_fips_map.state_county.astype('string').str.strip()
df_county_fips_map.set_index('state_county', inplace=True)
In [ ]:
#@title
# Concatenate the state and county names because county names are not unique across states.
df.res_county = df.res_county.str.lower()
df['state_county'] = df.res_state + '-' + df.res_county
df['state_county'] = df.state_county.astype('string').str.strip()
df.set_index('state_county', inplace=True)
df['race_ethnicity_combined'] = df.race_ethnicity_combined.astype('string').str.strip()
df = df.replace(to_replace={'race_ethnicity_combined': {
    'Asian, Non-Hispanic': 'asian_cases',
    'Black, Non-Hispanic': 'black_cases',
    'White, Non-Hispanic': 'white_cases',
    'American Indian/Alaska Native, Non-Hispanic': 'aian_cases',
    'Hispanic/Latino': 'hispanic_cases',
    'Multiple/Other, Non-Hispanic': 'other_cases',
    'Native Hawaiian/Other Pacific Islander, Non-Hispanic': 'nhpi_cases',
    'Missing': 'unknown_cases',
    'Unknown': 'unknown_cases',
    'NA': 'na_cases',
    #'Yes': 'black_cases',  # for hc_work_yn
    #'No': 'white_cases',  # for hc_work_yn
  }}
)
In [ ]:
#@title
merged_df = df.join(df_county_fips_map, on="state_county", how='inner', lsuffix='_left', rsuffix='_right')
#na_df = merged_df[merged_df['res_county'] == 'na']
#print(sum(na_df.cases))
#no_na_df = merged_df[merged_df['res_county'] != 'na']
#no_na_df = no_na_df[no_na_df['res_county'] != 'other']
#no_na_df = no_na_df[no_na_df['res_county'] != 'unknown']
#mismatch_df = no_na_df[no_na_df['county_fips'].isnull()]
#unique = pd.DataFrame(mismatch_df.index.unique())
#sum(mismatch_df.cases)
In [ ]:
#@title
# Create a crosstab table with rows = counties, columns = race_ethnicity_combined.
crosstab_df = pd.crosstab(merged_df['county_fips'], merged_df.race_ethnicity_combined, values=merged_df.cases, aggfunc=sum,
                          margins=True,
                          margins_name='total_cases'
)
# Have to reset_index() to go from pandas multi-index to single index.
crosstab_df = crosstab_df.reset_index()
crosstab_df.drop(axis=0, index=len(crosstab_df) - 1, inplace=True)
crosstab_df['county_fips'] = crosstab_df.county_fips.astype(int)
crosstab_df['total_known_cases'] = crosstab_df['total_cases'] - crosstab_df.unknown_cases.fillna(0)
crosstab_df['total_known_cases'] = crosstab_df['total_cases'] - crosstab_df.na_cases.fillna(0) - crosstab_df.unknown_cases.fillna(0)
In [ ]:
#@title
df_acs_name_lookup = pd.io.gbq.read_gbq(f'''
SELECT
  *
FROM
  `msm-internal-data.ipums_acs.acs_2019_5year_county`
''', project_id=project_id)

df_acs_name_lookup['state_county'] = df_acs_name_lookup.county.astype('string').str.strip() + ', ' + df_acs_name_lookup.state.astype('string').str.strip()
df_acs_name_lookup.drop(columns=['state', 'county'], inplace=True)
df_acs_name_lookup.set_index('county_fips', inplace=True)

county_chart_df = crosstab_df.join(df_acs_name_lookup, on="county_fips", how='inner', lsuffix='_left', rsuffix='_right')
county_chart_df.county_fips = county_chart_df.county_fips.astype(int)
In [ ]:
#@title
print(len(county_chart_df))
print(county_chart_df.total_pop.sum())
print(county_chart_df.total_pop.sum() / 324697795)  # Population covered in these counties
print(0.55 * 324697795) # NYT population
In [ ]:
#@title
nyt_counties_query = ('''
SELECT
  county_fips_code,
  confirmed_cases as nyt_cases,
FROM `bigquery-public-data.covid19_nyt.us_counties`
WHERE
  date = DATE(2020, 12, 16) AND
  county_fips_code IS NOT NULL
''')
nyt_counties_df = pd.io.gbq.read_gbq(nyt_counties_query, project_id=project_id)
nyt_counties_df.rename(columns={'county_fips_code': 'county_fips'}, inplace=True)
nyt_counties_df.county_fips.unique()
nyt_counties_df['county_fips'] = nyt_counties_df.county_fips.astype(int)
nyt_counties_df.set_index('county_fips', inplace=True)
In [ ]:
#@title
county_chart_df.set_index('county_fips', inplace=True)
nyt_merged_df = county_chart_df.join(nyt_counties_df, on="county_fips", how='inner', lsuffix='_left', rsuffix='_right')
nyt_merged_df = nyt_merged_df.reset_index()
#nyt_merged_df.county_fips = nyt_merged_df.county_fips.astype(int)
nyt_merged_df['percent'] = round(nyt_merged_df.total_cases / nyt_merged_df.nyt_cases, 2)
#nyt_merged_df.reset_index(inplace=True)
In [ ]:
#@title
tooltips = [alt.Tooltip('state_county:N', title='County'),
              #alt.Tooltip('crdt_known_cases:Q', format=',', title='CDC cases'),
              #alt.Tooltip('cdc_known_cases:Q', format=',', title='CRDT cases'),
              alt.Tooltip('total_cases:Q', format=',', title='CDC cases'),
              alt.Tooltip('nyt_cases:Q', format=',', title='NYT cases'),
              alt.Tooltip('percent:Q', format='.2f', title='Ratio of CDC to NYUT'),
]

plot = alt.Chart(nyt_merged_df).mark_circle(size=60).encode(
    alt.X('nyt_cases:Q',
    #alt.X('crdt_known_cases:Q',
        scale=alt.Scale(domain=(0, 100000), clamp=True),
        axis=alt.Axis(title='NYT cases')
        #scale=alt.Scale(domain=(0, 1200000))
    ),
    alt.Y('total_cases:Q',
    #alt.Y('cdc_known_cases:Q',
        scale=alt.Scale(domain=(0, 100000), clamp=True),
        axis=alt.Axis(title='CDC cases')
        #scale=alt.Scale(domain=(0, 1200000))
    ),
    color=alt.Color('percent', scale=alt.Scale(scheme='blueorange',
                                               reverse=True,
                                               domain=[0, 2], clamp=True),
                    legend=alt.Legend(format='.2f'),
                    title='Ratio'),
 tooltip=tooltips,
).properties(
     width=350,
    height=350,
)

line = pd.DataFrame({
    'x': [0, 100000],
    'y': [0, 100000],
    #'x': [0, 1200000],
    #'y': [0, 1200000],
})

line_plot = alt.Chart(line).mark_line(color= 'black').encode(
    x='x',
    y='y',
).properties(
     width=350,
    height=350,
)

scatter_clamp = (plot + line_plot).properties(title='Zoom in on counties up to 100,000 population')
In [ ]:
#@title
tooltips = [alt.Tooltip('state_county:N', title='County'),
              #alt.Tooltip('crdt_known_cases:Q', format=',', title='CDC cases'),
              #alt.Tooltip('cdc_known_cases:Q', format=',', title='CRDT cases'),
              alt.Tooltip('total_cases:Q', format=',', title='CDC cases'),
              alt.Tooltip('nyt_cases:Q', format=',', title='NYT cases'),
              alt.Tooltip('percent:Q', format='.2f', title='CDC to NYT ratio'),
]

plot = alt.Chart(nyt_merged_df).mark_circle(size=60).encode(
    alt.X('nyt_cases:Q',
    #alt.X('crdt_known_cases:Q',
        scale=alt.Scale(domain=(0, 600000)),
        axis=alt.Axis(title='NYT cases')
        #scale=alt.Scale(domain=(0, 1200000))
    ),
    alt.Y('total_cases:Q',
    #alt.Y('cdc_known_cases:Q',
        scale=alt.Scale(domain=(0, 600000)),
        axis=alt.Axis(title='CDC cases')
        #scale=alt.Scale(domain=(0, 1200000))
    ),
    color=alt.Color('percent', scale=alt.Scale(scheme='blueorange',
                                               reverse=True,
                                               domain=[0, 2], clamp=True),
                    legend=alt.Legend(format='.2f'),
                    title='Ratio'),
 tooltip=tooltips,
)

line = pd.DataFrame({
    'x': [0, 600000],
    'y': [0, 600000],
    #'x': [0, 1200000],
    #'y': [0, 1200000],
})

line_plot = alt.Chart(line).mark_line(color= 'black').encode(
    x='x',
    y='y',
).properties(
     width=350,
    height=350,
)

scatter_all = (plot + line_plot).properties(
    title='Ratio of CDC to NYT cases by county as of Dec 16'
).properties(
     width=350,
    height=350,
)

(scatter_all | scatter_clamp
 ).configure_mark(stroke='grey'
 ).display()
In [ ]:
#@title
us_states = alt.topo_feature(data.us_10m.url, 'states')
us_counties = alt.topo_feature(data.us_10m.url+"#", 'counties')

highlight = alt.selection_single(on='mouseover', fields=['id', 'county_fips'], empty='none')
tooltips = [alt.Tooltip('state_county:N', title='County'),
            alt.Tooltip('total_cases:Q', format=',', title='CDC cases'),
            alt.Tooltip('nyt_cases:Q', format=',', title='NYT cases'),
              alt.Tooltip('percent:Q', format='.2f', title='Ratio of CDC to NYT')
]

#plot = alt.Chart(us_states).mark_geoshape(
plot = alt.Chart(us_counties).mark_geoshape(
      stroke='white',
      strokeOpacity=.2,
      strokeWidth=1
  ).project(
    type='albersUsa'
  ).transform_lookup(
      lookup='id',
      from_=alt.LookupData(nyt_merged_df, 'county_fips', ['percent', 'state_county', 'total_cases', 'nyt_cases'])
      #from_=alt.LookupData(nyt_merged_df, 'county_fips', ['percent'])
  ).encode(
      alt.Color('percent',  
                type='quantitative', 
                legend=alt.Legend(format='.2f'),
                scale=alt.Scale(scheme='blueorange',
                                reverse=True,
                                domain=[0, 2],
                                clamp=True,
                                ),
                title=''),
       tooltip=tooltips
  ).add_selection(
      highlight,
  )

states_outline = alt.Chart(us_states).mark_geoshape(stroke='white', strokeWidth=1.5, fillOpacity=0, fill='white').project(
      type='albersUsa'
)

states_fill = alt.Chart(us_states).mark_geoshape(
      fill='silver',
      stroke='white'
).project('albersUsa')

layered_map = alt.layer(states_fill, plot, states_outline).properties(
      width=900,
      height=650,
      title='Ratio of CDC Cases to NYT cases as of Dec 16'
)

layered_map.configure_legend(
      orient='top-right',
      gradientLength=200,
      titleLimit=0,
  ).configure_view(
      strokeWidth=0,
).display()
In [ ]:
#@title
nyt_merged_df.percent.describe()

Race/ethnicity

In [ ]:
#@title
compare_cases_unknowns_query = ('''
SELECT
  res_state,
  race_ethnicity_combined,
  COUNT(*) as cdc_cases
FROM
  %s
#WHERE
#  death_yn = 'Yes'
GROUP BY
   res_state,
   race_ethnicity_combined
''')
states_df = pd.io.gbq.read_gbq(compare_cases_unknowns_query % table, project_id=project_id)
#states_df = states_df.replace(to_replace={'res_state': states_to_fips})
states_df = states_df[states_df.res_state != 'Unknown']
states_df = states_df[states_df.res_state != 'NA']
states_df = states_df[states_df.res_state != 'OCONUS']
#states_df = states_df.reset_index()

states_df['race_ethnicity_combined'] = states_df.race_ethnicity_combined.astype('string').str.strip()
states_df = states_df.replace(to_replace={'race_ethnicity_combined': {
    'Asian, Non-Hispanic': 'cdc_known_cases',
    'Black, Non-Hispanic': 'cdc_known_cases',
    'White, Non-Hispanic': 'cdc_known_cases',
    'American Indian/Alaska Native, Non-Hispanic': 'cdc_known_cases',
    'Hispanic/Latino': 'cdc_known_cases',
    'Multiple/Other, Non-Hispanic': 'cdc_known_cases',
    'Native Hawaiian/Other Pacific Islander, Non-Hispanic': 'cdc_known_cases',
    'Missing': 'cdc_unknown_cases',
    'Unknown': 'cdc_unknown_cases',
    'NA': 'cdc_na_cases',
    }})
states_df.rename(columns={'res_state': 'state'}, inplace=True)
#states_df['state_fips_code'] = states_df.state_fips_code.astype(int)
#states_df.set_index('state_fips_code', inplace=True)
In [ ]:
#@title
crosstab_df = pd.crosstab(states_df['state'], states_df.race_ethnicity_combined, values=states_df.cdc_cases, aggfunc=sum,
                          margins=True,
                          margins_name='cdc_cases'
)
# Have to reset_index() to go from pandas multi-index to single index.
crosstab_df = crosstab_df.reset_index()
crosstab_df.drop(axis=0, index=len(crosstab_df) - 1, inplace=True)
#crosstab_df['state_fips_code'] = crosstab_df.state_fips_code.astype(int)
crosstab_df['cdc_known_or_na_cases'] = crosstab_df['cdc_cases'] - crosstab_df.cdc_unknown_cases.fillna(0)
crosstab_df['cdc_known_cases'] = crosstab_df['cdc_cases'] - crosstab_df.cdc_na_cases.fillna(0) - crosstab_df.cdc_unknown_cases.fillna(0)
crosstab_df

crdt_merged_df = crosstab_df.join(crdt_df, on="state", how='inner', lsuffix='_left', rsuffix='_right')
crdt_merged_df.reset_index(inplace=True)
crdt_merged_df['state_fips_code'] = crdt_merged_df.state
crdt_merged_df = crdt_merged_df.replace(to_replace={'state_fips_code': states_to_fips})
crdt_merged_df['cdc_known_cases_percent'] = round(crdt_merged_df.cdc_known_cases / crdt_merged_df.cdc_cases, 4)
crdt_merged_df['cdc_known_or_na_cases_percent'] = round(crdt_merged_df.cdc_known_or_na_cases / crdt_merged_df.cdc_cases, 4)
crdt_merged_df['percent'] = round(crdt_merged_df.cdc_known_cases_percent / crdt_merged_df.crdt_known_cases_percent, 4)
crdt_merged_df['percent_with_na'] = round(crdt_merged_df.cdc_known_or_na_cases_percent / crdt_merged_df.crdt_known_cases_percent, 4)
crdt_merged_df['percent_counts'] = round(crdt_merged_df.cdc_known_cases / crdt_merged_df.crdt_known_cases, 4)
crdt_merged_df['one'] = round(crdt_merged_df.crdt_cases / crdt_merged_df.crdt_cases, 4)
crdt_merged_df_no_ny = crdt_merged_df[crdt_merged_df.state != 'NY']

States: CDC vs. CRDT

In [ ]:
#@title
tooltips = [alt.Tooltip('state:N', title='State'),
              #alt.Tooltip('crdt_known_cases:Q', format=',', title='CDC cases'),
              #alt.Tooltip('cdc_known_cases:Q', format=',', title='CRDT cases'),
              alt.Tooltip('cdc_known_cases:Q', format=',', title='CDC known cases'),
              alt.Tooltip('crdt_known_cases:Q', format=',', title='CRDT known cases'),
              alt.Tooltip('percent_counts:Q', format='.2f', title='Ratio of CDC to CRDT'),
]

plot = alt.Chart(crdt_merged_df).mark_circle(size=60).encode(
    alt.X('crdt_known_cases:Q',
    #alt.X('crdt_known_cases:Q',
        #scale=alt.Scale(domain=(0, 1)),
        axis=alt.Axis(title='CRDT known race/ethnicity cases'),
        scale=alt.Scale(domain=(0, 1200000))
    ),
    alt.Y('cdc_known_cases:Q',
    #alt.Y('cdc_known_cases:Q',
        #scale=alt.Scale(domain=(0, 1)),
        axis=alt.Axis(title='CDC known race/ethnicity cases'),
        scale=alt.Scale(domain=(0, 1200000))
    ),
    color=alt.Color('percent_counts', scale=alt.Scale(scheme='blueorange',
                                                      reverse=True,
                                                      domain=[0, 2], clamp=True),
                    legend=alt.Legend(format='.2f'),
                    title='Ratio'
                    ),
 tooltip=tooltips,
)

line = pd.DataFrame({
    #'x': [0, 1],
    #'y': [0, 1],
    'x': [0, 1200000],
    'y': [0, 1200000],
})

line_plot = alt.Chart(line).mark_line(color= 'black').encode(
    x='x',
    y='y',
).properties(
    height=350,
    width=350
)

scatter = (plot + line_plot).properties(
    title='Ratio of CDC to CRDT cases with known race/ethnicity by state as of Dec 16'
    ).configure_mark(stroke='grey'
)
scatter.display()

The ratio of CDC to CRDT cases with known race/ethnicity is within the range 0.03-1.64 for all states excluding New York, which has 0 known cases in CRDT, is:

  • Average = 0.63
  • Median = 0.77
  • Min = 0.01 (North Dakota, Louisiana, Wyoming)
  • Max = 1.18 (Massachusetts)
  • Percent between 0.85 and 1.15 = 34%

Only 4 states had more cases with known race/ethnicity in the CDC data than in the CRDT data, whereas 23 states had more total cases in the CDC data than the CRDT data. We can again view this ratio on a map:

In [ ]:
#@title
us_states = alt.topo_feature(data.us_10m.url, 'states')
us_counties = alt.topo_feature(data.us_10m.url+"#", 'counties')

highlight = alt.selection_single(on='mouseover', fields=['id', 'state_fips_code'], empty='none')

tooltips = [alt.Tooltip('state:N', title='State'),
              alt.Tooltip('cdc_known_cases:Q', format=',', title='CDC known cases'),
              alt.Tooltip('crdt_known_cases:Q', format=',', title='CRDT known cases'),
              alt.Tooltip('percent_counts:Q', format='.2f', title='Ratio of CDC to CRDT'),
]

plot = alt.Chart(us_states).mark_geoshape(
      stroke='white',
      strokeOpacity=.2,
      strokeWidth=1
  ).project(
    type='albersUsa'
  ).transform_lookup(
      lookup='id',
      from_=alt.LookupData(crdt_merged_df, 'state_fips_code', ['percent_counts', 'state', 'cdc_known_cases', 'crdt_known_cases'])
  ).encode(
      alt.Color('percent_counts',  
                type='quantitative', 
                legend=alt.Legend(format='.2f'),
                scale=alt.Scale(scheme='blueorange',
                                reverse=True,
                                domain=[0, 2],
                                clamp=True,
                                ),
                title=''),
       tooltip=tooltips
  ).add_selection(
      highlight,
  )

states_outline = alt.Chart(us_states).mark_geoshape(stroke='white', strokeWidth=1.5, fillOpacity=0, fill='white').project(
      type='albersUsa'
)

states_fill = alt.Chart(us_states).mark_geoshape(
      fill='silver',
      stroke='white'
).project('albersUsa')

layered_map = alt.layer(states_fill, plot, states_outline).properties(
      width=500,
      height=400,
      title='Ratio of CDC to CRDT cases with known race/ethnicity as of Dec 16'
).configure_legend(
      orient='top-right',
      gradientLength=200,
      titleLimit=0,
  ).configure_view(
      strokeWidth=0,
)
layered_map.display()

The differences between CDC and CRDT cases with known race/ethnicity were more extreme than for total cases:

  • 9 states: < 0.25 ratio of CDC to CRDT cases with known race/ethnicity
  • 18 states: < 0.50 ratio of CDC to CRDT cases with known race/ethnicity
  • 1 state > 1.50 ratio of CDC to CRDT cases (Alaska)
In [ ]:
#@title
crdt_merged_df_no_ny.percent_counts.sort_values()

What accounts for the differences between the CDC and CRDT for the number of cases with known race/ethnicity? The following factors contribute to the differences between CDC and CRDT:

  1. Total case counts, which we examined in the Total Cases section
  2. The percent of total cases with known race/ethnicity

The map and chart above combine both of those factors. We examined the first factor in the Total Cases states comparison above. We can separately look at the second factor, the percentage of total cases with known race/ethnicity, below.

In [ ]:
#@title
us_states = alt.topo_feature(data.us_10m.url, 'states')
us_counties = alt.topo_feature(data.us_10m.url+"#", 'counties')

highlight = alt.selection_single(on='mouseover', fields=['id', 'state_fips_code'], empty='none')

tooltips = [alt.Tooltip('state:N', title='State'),
              alt.Tooltip('cdc_cases:Q', format=',', title='CDC cases'),
              alt.Tooltip('cdc_known_cases:Q', format=',', title='Known race/ethnicity cases'),
              alt.Tooltip('cdc_known_cases_percent:Q', format='.1%', title='Percent known cases'),
]

plot = alt.Chart(us_states).mark_geoshape(
      stroke='white',
      strokeOpacity=.2,
      strokeWidth=1
  ).project(
    type='albersUsa'
  ).transform_lookup(
      lookup='id',
      from_=alt.LookupData(crdt_merged_df, 'state_fips_code', ['percent', 'state', 'cdc_cases', 'cdc_known_cases', 'cdc_known_cases_percent'])
  ).encode(
      alt.Color('cdc_known_cases_percent',  
                type='quantitative', 
                legend=alt.Legend(format='.0%'),
                scale=alt.Scale(scheme='redyellowblue',
                                domain=[0, 1],
                                clamp=True,
                                ),
                title=''),
       tooltip=tooltips
  ).add_selection(
      highlight,
  )

states_outline = alt.Chart(us_states).mark_geoshape(stroke='white', strokeWidth=1.5, fillOpacity=0, fill='white').project(
      type='albersUsa'
)

states_fill = alt.Chart(us_states).mark_geoshape(
      fill='#black',
      stroke='white'
).project('albersUsa')

cdc_known_layered_map = alt.layer(states_fill, plot, states_outline).properties(
      width=450,
      height=350,
      title='Percent of CDC cases with known race/ethnicity as of Dec 16'
)
In [ ]:
#@title
us_states = alt.topo_feature(data.us_10m.url, 'states')
us_counties = alt.topo_feature(data.us_10m.url+"#", 'counties')

highlight = alt.selection_single(on='mouseover', fields=['id', 'state_fips_code'], empty='none')

tooltips = [alt.Tooltip('state:N', title='State'),
              alt.Tooltip('crdt_cases:Q', format=',', title='CRDT cases'),
              alt.Tooltip('crdt_known_cases:Q', format=',', title='Known race/ethnicity cases'),
              alt.Tooltip('crdt_known_cases_percent:Q', format='.1%', title='Percent known cases'),
]

plot = alt.Chart(us_states).mark_geoshape(
      stroke='white',
      strokeOpacity=.2,
      strokeWidth=1
  ).project(
    type='albersUsa'
  ).transform_lookup(
      lookup='id',
      from_=alt.LookupData(crdt_merged_df, 'state_fips_code', ['percent', 'state', 'crdt_cases', 'crdt_known_cases', 'crdt_known_cases_percent'])
  ).encode(
      alt.Color('crdt_known_cases_percent',  
                type='quantitative', 
                legend=alt.Legend(format='.0%'),
                scale=alt.Scale(scheme='redyellowblue',
                                domain=[0, 1],
                                clamp=True,
                                ),
                title=''),
       tooltip=tooltips
  ).add_selection(
      highlight,
  )

states_outline = alt.Chart(us_states).mark_geoshape(stroke='white', strokeWidth=1.5, fillOpacity=0, fill='white').project(
      type='albersUsa'
)

states_fill = alt.Chart(us_states).mark_geoshape(
      fill='#black',
      stroke='white'
).project('albersUsa')

crdt_known_layered_map = alt.layer(states_fill, plot, states_outline).properties(
      width=450,
      height=350,
      title='Percent of CRDT cases with known race/ethnicity as of Dec 16'
)
(cdc_known_layered_map | crdt_known_layered_map).configure_legend(
      orient='top',
      gradientLength=200,
      titleLimit=0,
  ).configure_view(
      strokeWidth=0,
).display()
In [ ]:
#@title
crdt_merged_df
In [ ]:
#@title
us_states = alt.topo_feature(data.us_10m.url, 'states')
us_counties = alt.topo_feature(data.us_10m.url+"#", 'counties')

highlight = alt.selection_single(on='mouseover', fields=['id', 'state_fips_code'], empty='none')

tooltips = [alt.Tooltip('state:N', title='State'),
              alt.Tooltip('cdc_cases:Q', format=',', title='CDC cases'),
              alt.Tooltip('cdc_known_or_na_cases:Q', format=',', title='Known or suppressed race/ethnicity cases'),
              alt.Tooltip('cdc_known_or_na_cases_percent:Q', format='.1%', title='Percent known or suppressed cases'),
]

plot = alt.Chart(us_states).mark_geoshape(
      stroke='white',
      strokeOpacity=.2,
      strokeWidth=1
  ).project(
    type='albersUsa'
  ).transform_lookup(
      lookup='id',
      from_=alt.LookupData(crdt_merged_df, 'state_fips_code', ['percent', 'state', 'cdc_cases', 'cdc_known_cases', 'cdc_known_or_na_cases_percent', 'cdc_known_or_na_cases'])
  ).encode(
      alt.Color('cdc_known_or_na_cases_percent',  
                type='quantitative', 
                legend=alt.Legend(format='.0%'),
                scale=alt.Scale(scheme='redyellowblue',
                                domain=[0, 1],
                                clamp=True,
                                ),
                title=''),
       tooltip=tooltips
  ).add_selection(
      highlight,
  )

states_outline = alt.Chart(us_states).mark_geoshape(stroke='white', strokeWidth=1.5, fillOpacity=0, fill='white').project(
      type='albersUsa'
)

states_fill = alt.Chart(us_states).mark_geoshape(
      fill='#black',
      stroke='white'
).project('albersUsa')

known_or_na_layered_map = alt.layer(states_fill, plot, states_outline).properties(
      width=500,
      height=400,
      title='Ratio of CDC cases with known or suppressed race/ethnicity as of Dec 16'
).configure_legend(
      orient='top-right',
      gradientLength=200,
      titleLimit=0,
  ).configure_view(
      strokeWidth=0,
)
#known_or_na_layered_map
In [ ]:
#@title
tooltips = [alt.Tooltip('state:N', title='State'),
              #alt.Tooltip('crdt_known_cases:Q', format=',', title='CDC cases'),
              #alt.Tooltip('cdc_known_cases:Q', format=',', title='CRDT cases'),
              alt.Tooltip('cdc_known_cases_percent:Q', format='.0%', title='CDC known cases'),
              alt.Tooltip('crdt_known_cases_percent:Q', format='.0%', title='CRDT known cases'),
              alt.Tooltip('percent:Q', format='.2f', title='Ratio of CDC to CRDT'),
]

plot = alt.Chart(crdt_merged_df).mark_circle(size=60).encode(
    alt.X('crdt_known_cases_percent:Q',
    #alt.X('crdt_known_cases:Q',
        scale=alt.Scale(domain=(0, 1)),
        axis=alt.Axis(title='CRDT cases', format='.0%')
        #scale=alt.Scale(domain=(0, 1200000))
    ),
    alt.Y('cdc_known_cases_percent:Q',
    #alt.Y('cdc_known_cases:Q',
        scale=alt.Scale(domain=(0, 1)),
        axis=alt.Axis(title='CDC cases', format='.0%')
        #scale=alt.Scale(domain=(0, 1200000))
    ),
    color=alt.Color('percent', scale=alt.Scale(scheme='blueorange',
                                               reverse=True,
                                               domain=[0, 2], clamp=True),
                    legend=alt.Legend(format='.2f'),
                    title='Ratio'
                    ),
 tooltip=tooltips,
)

line = pd.DataFrame({
    'x': [0, 1],
    'y': [0, 1],
    #'x': [0, 1200000],
    #'y': [0, 1200000],
})

line_plot = alt.Chart(line).mark_line(color= 'black').encode(
    x='x',
    y='y',
).properties(
    height=350,
    width=350
)

scatter = (plot + line_plot).properties(
    title='Ratio of CDC to CRDT percent of cases with known race/ethnicity by state as of Dec 16'
    )

scatter.configure_mark(stroke='grey').display()
In [ ]:
#@title
us_states = alt.topo_feature(data.us_10m.url, 'states')
us_counties = alt.topo_feature(data.us_10m.url+"#", 'counties')

highlight = alt.selection_single(on='mouseover', fields=['id', 'state_fips_code'], empty='none')

tooltips = [alt.Tooltip('state:N', title='State'),
              alt.Tooltip('cdc_known_cases_percent:Q', format='.0%', title='CDC known cases'),
              alt.Tooltip('crdt_known_cases_percent:Q', format='.0%', title='CRDT known cases'),
              alt.Tooltip('percent:Q', format='.2f', title='Ratio of CDC to CRDT'),
]

plot = alt.Chart(us_states).mark_geoshape(
      stroke='white',
      strokeOpacity=.2,
      strokeWidth=1
  ).project(
    type='albersUsa'
  ).transform_lookup(
      lookup='id',
      from_=alt.LookupData(crdt_merged_df, 'state_fips_code', ['percent', 'state', 'cdc_known_cases_percent', 'crdt_known_cases_percent'])
  ).encode(
      alt.Color('percent',  
                type='quantitative', 
                legend=alt.Legend(format='.2f'),
                scale=alt.Scale(scheme='blueorange',
                                reverse=True,
                                domain=[0, 2],
                                clamp=True,
                                ),
                title=''),
       tooltip=tooltips
  ).add_selection(
      highlight,
  )

states_outline = alt.Chart(us_states).mark_geoshape(stroke='white', strokeWidth=1.5, fillOpacity=0, fill='white').project(
      type='albersUsa'
)

states_fill = alt.Chart(us_states).mark_geoshape(
      fill='silver',
      stroke='white'
).project('albersUsa')

layered_map = alt.layer(states_fill, plot, states_outline).properties(
      width=500,
      height=400,
      title='Ratio of CDC to CRDT percent of cases with known race/ethnicity as of Dec 16'
).configure_legend(
      orient='top-right',
      gradientLength=200,
      titleLimit=0,
  ).configure_view(
      strokeWidth=0,
).display()

When calculating the disparities between COVID-19 cases by race/ethnicity group, we will need to be cautious to draw conclusions from data in states where there is race/ethnicity data for a small percentage of the population and/or the overall case totals are incomplete. For example, California only has 21% of cases with race and ethnicity, and 0% (check this) of the cases are Hispanic/Latino.

Counties: CDC

We don't have a point of comparison for the known race/ethnicity percentage at the county level, but we can look at the breakdown by county to see the variation between counties within the same state.

In [ ]:
#@title
def GenerateColNames(group):
  cases_col = group + '_cases'
  pop_col = group + '_pop'
  pop_percent_col = group + '_percent'
  cases_percent_col = group + '_cases_percent'
  cases_percent_with_unknown_col = group + '_cases_percent_with_unknown'
  cases_per_100_col = group + '_cases_per_100'
  cases_to_pop_col= group + '_cases_to_pop'
  cases_to_pop_with_unknown_col= group + '_cases_to_pop_with_unknown'
  return {'cases': cases_col,
          'pop': pop_col,
          'pop_percent': pop_percent_col,
          'cases_per_100': cases_per_100_col,
          'cases_percent': cases_percent_col,
          'cases_percent_with_unknown': cases_percent_with_unknown_col,
          'cases_to_pop': cases_to_pop_col,
          'cases_to_pop_with_unknown': cases_to_pop_with_unknown_col,
  }

group_names = {}
for group in race_ethnicity_groups:
  group_names[group] = GenerateColNames(group)

chart_df = county_chart_df.copy(deep=True)
chart_df.reset_index(inplace=True)
chart_df.county_fips = chart_df.county_fips.astype(int)
for group in race_ethnicity_groups:
  chart_df[group_names[group]['cases_per_100']] = round(chart_df[group_names[group]['cases']] / chart_df[group_names[group]['pop']], 4)
  chart_df[group_names[group]['cases_percent']] = round(chart_df[group_names[group]['cases']] / chart_df.total_known_cases, 2)
  chart_df[group_names[group]['cases_percent_with_unknown']] = round(chart_df[group_names[group]['cases']] / chart_df.total_cases, 2)
  chart_df[group_names[group]['cases_to_pop']] = round(
      chart_df[group_names[group]['cases_percent']] / chart_df[group_names[group]['pop_percent']], 2)
  chart_df[group_names[group]['cases_to_pop_with_unknown']] = round(
      chart_df[group_names[group]['cases_percent_with_unknown']] / chart_df[group_names[group]['pop_percent']], 2)
chart_df['percent_known_cases'] = round(chart_df.total_known_cases / chart_df.total_cases, 2)
chart_df['percent_known_or_na_cases'] = round((chart_df.total_known_cases + chart_df.na_cases) / chart_df.total_cases, 2)
In [ ]:
#@title
us_states = alt.topo_feature(data.us_10m.url, 'states')
us_counties = alt.topo_feature(data.us_10m.url+"#", 'counties')

filter_data = False
#MIN_POP_PERCENT = 0.001
MIN_POP = 100
MIN_PERCENT_KNOWN = 0.5
MIN_CASES = 5

group_to_display_name = {
    'black': 'Black',
    'white': 'White',
    'hispanic': 'Hispanic/Latino',
    'asian': 'Asian',
    'nhpi': 'Native Hawaiian/Pacific Islander',
    'aian': 'American Indian/Alaska Native',
    'other': 'Other or multiple race/ethnicity',
    'total': 'Total'
}

group_to_short_name = {
    'black': 'Black',
    'white': 'White',
    'hispanic': 'Hispanic',
    'asian': 'Asian',
    'nhpi': 'NHPI',
    'aian': 'AIAN',
    'other': 'Other',
    'total': 'Total'
}

chart_col_to_color_scheme = {
    'cases_per_100': 'yelloworangebrown',
    'cases_to_pop': 'blueorange',
    'percent_known_cases': 'redyellowblue',
    'percent_known_or_na_cases': 'redyellowblue',
}
chart_col_to_legend_format = {
    'cases_per_100': '.0%',
    'cases_to_pop': '.1f',
    'percent_known_cases': '.0%',
    'percent_known_or_na_cases': '.0%',
}

def GenerateCountyMap(chart_df, chart_col, group, group_names, metric, big_or_small, date):
  group_chart_col = chart_col
  if group:
    group_chart_col = group_names[group][chart_col]
  group_display_name = ''
  if group:
    group_display_name = group_to_short_name[group]
    group_short_name = group_to_short_name[group]

  width = 900
  height = 650
  if big_or_small == 'small':
    width /= 2
    height /= 2
  
  chart_col_to_range = {
    'cases_per_100': [0, .2],
    'cases_to_pop': [0, 2],
    'percent_known_cases': [0, 1],
    'percent_known_or_na_cases': [0, 1],
  }    

  prevalence_text = 'who had COVID-19'

  col_to_title = {
      'total_cases': group_display_name + ' ' + metric + ' as of ' + date,
      'cases_per_100': 'Percent of ' + group_display_name + ' Population ' + prevalence_text + ' as of ' + date,
      'cases_to_pop': 'Ratio of ' + group_display_name + ' ' + metric + ' Share to Population Share'  + ' as of ' + date,
      'percent_known_cases': 'Percent of ' + metric + ' with Known Race/Ethnicity' + ' as of ' + date,
      'percent_known_or_na_cases': 'Percent of ' + metric + ' with Known or suppressed Race/Ethnicity' + ' as of ' + date,
  }

  filtered_chart_df = chart_df
  if group and filter_data:
    #filtered_chart_df = filtered_chart_df[filtered_chart_df[group_names[group]['pop_percent']] > MIN_POP_PERCENT]
    filtered_chart_df = filtered_chart_df[filtered_chart_df[group_names[group]['pop']] > MIN_POP]
    filtered_chart_df = filtered_chart_df[filtered_chart_df['percent_known_cases'] > MIN_PERCENT_KNOWN]
    filtered_chart_df = filtered_chart_df[filtered_chart_df[group_names[group]['cases']] > MIN_CASES]

  highlight = alt.selection_single(on='mouseover', fields=['id', 'county_fips'], empty='none')

  data_cols = ['state_county',
               'percent_known_cases',
               'percent_known_or_na_cases',
               'total_cases']
  if group:
    data_cols.extend([
                      group_names[group]['cases'],
                      group_names[group]['pop'],
                      group_names[group]['pop_percent'],
                      group_names[group]['cases_per_100'],
                      group_names[group]['cases_percent'],
                      group_names[group]['cases_percent_with_unknown'],
                      group_names[group]['cases_to_pop'],
                      group_names[group]['cases_to_pop_with_unknown'],
                      ])

  tooltips = [alt.Tooltip('state_county:N', title='County'),
              alt.Tooltip('percent_known_cases:Q', format='.0%', title=metric + ' with race/ethnicity')
  ]
  if chart_col in ('percent_known_cases', 'percent_known_or_na_cases'):
    tooltips.extend([
               alt.Tooltip('total_cases:Q', format=',.0f', title=metric)
   ])
  if chart_col == 'percent_known_or_na_cases':
    tooltips.extend([
               alt.Tooltip('percent_known_or_na_cases:Q', format='.0%',
                           title=metric + ' with known or suppressed race/ethnicity')
   ])
  if group:
    tooltips.extend([
                alt.Tooltip(group_names[group]['cases'] + ':Q', format=',',
                            title=group_short_name + ' ' + metric.lower()),
    ])
    if chart_col == 'cases_per_100':
      tooltips.extend([
                  alt.Tooltip(group_names[group]['pop'] + ':Q', format=',',
                            title=group_short_name + ' population'),
                  alt.Tooltip(group_names[group]['cases_per_100'] + ':Q', format='.2%',
                              title='Percent ' + prevalence_text)
      ])
    elif chart_col == 'cases_to_pop':
      tooltips.extend([
                  alt.Tooltip(group_names[group]['cases_percent_with_unknown'] + ':Q', format='.1%',
                              title='Percent of total ' + metric.lower()),
                  alt.Tooltip(group_names[group]['cases_percent'] + ':Q', format='.1%',
                              title='Percent of known race/ethnicity ' + metric.lower()),
                  alt.Tooltip(group_names[group]['pop_percent'] + ':Q', format='.1%',
                              title=group_short_name + ' percent of population'),
                  alt.Tooltip(group_names[group]['cases_to_pop'] + ':Q', format='.2f',
                              title='Ratio of percent of known ' + metric.lower() + ' to percent of population'),
                  #alt.Tooltip(group_names[group]['cases_to_pop_with_unknown'] + ':Q', format='.2f',
                  #            title='Ratio of ' + metric.lower() + ' to population including unknowns'),
      ])
  reverse_scale = False
  if chart_col == 'cases_to_pop':
    reverse_scale = True

  plot = alt.Chart(us_counties).mark_geoshape(
      stroke='white',
      strokeOpacity=.2,
      strokeWidth=1
  ).project(
    type='albersUsa'
  ).transform_lookup(
      lookup='id',
      from_=alt.LookupData(filtered_chart_df, 'county_fips', data_cols)
  ).encode(
      alt.Color(group_chart_col,  
                type='quantitative', 
                legend=alt.Legend(format=chart_col_to_legend_format[chart_col]),
                scale=alt.Scale(scheme=chart_col_to_color_scheme[chart_col],
                                reverse=reverse_scale,
                                domain=chart_col_to_range[chart_col],
                                clamp=True,
                                ),
                title=''),
       tooltip=tooltips
  ).add_selection(
      highlight,
  )

  states_outline = alt.Chart(us_states).mark_geoshape(stroke='white', strokeWidth=1.5, fillOpacity=0, fill='white').project(
      type='albersUsa'
  )

  states_fill = alt.Chart(us_states).mark_geoshape(
      fill='silver',
      #fill='#E0E0E0',
      #fill='lightgrey',
      stroke='white'
  ).project('albersUsa')

  layered_map = alt.layer(states_fill, plot, states_outline).properties(
      width=width,
      height=height,
      title=col_to_title[chart_col],
  )
  return layered_map
In [ ]:
#@title
big_charts = {'cases_per_100': {}, 'cases_to_pop': {}}
small_charts = {'cases_per_100': {}, 'cases_to_pop': {}}

known_percent = GenerateCountyMap(
    chart_df, 'percent_known_cases', None, group_names, metric, 'big', date)
percent_known_or_na_cases = GenerateCountyMap(
    chart_df, 'percent_known_or_na_cases', None, group_names, metric, 'big', date)

for group in race_ethnicity_groups:
  for value in ('cases_per_100', 'cases_to_pop'):
    big_charts[value][group] = GenerateCountyMap(
        chart_df, value, group, group_names, metric, 'big', date)
    small_charts[value][group] = GenerateCountyMap(
        chart_df, value, group, group_names, metric, 'small', date)
In [ ]:
#@title
known_percent.configure_legend(
      orient='top-right',
      gradientLength=400,
      titleLimit=0,
  ).configure_view(
      strokeWidth=0,
  ).display()

Disparities Analysis

U.S.: Race/ethnicity and age

All charts showing case counts are limited to lab-confirmed or probable case counts and undercount the actual case counts; we use 'Cases' as shorthand for 'Confirmed (Lab-confirmed or Probable) cases'. Furthermore, we know that the CDC data up to Dec 16, 2020 only contains about 80% of the case counts in the CRDT, and so the case counts are further undercounted here either due to a lag in reporting these cases to the CDC vs. on state public health websites or more systematic errors that affect some states and population groups more than others.

As a point of comparison, the CRDT case counts up to Dec 16 show 5.12% of the U.S. population having had COVID-19 whereas the CDC data below shows only 4.09% of the U.S. population having had COVID-19. Similarly, we could imagine the numbers below being at least only 80% of the actual case counts. We don't attempt to correct for this given that the missing data is unlikely to be spread uniformly across all race/ethnicity and age groups.

In [ ]:
#@title

test = pd.DataFrame.from_dict({'group': [
                                         'Black', 'Hispanic/Latino', 'White', 'Asian/NHPI', 'AIAN', '-Total-',
                                         ],
                               'percent': [.0234, .0254, .0211, .0129, .0376, .0409]})
alt.Chart(test).mark_bar().encode(
      x=alt.X('percent', axis=alt.Axis(format='.1%'), title=''),
      y=alt.Y('group', sort='-x', title=''),
      color=alt.Color('group', 
                      scale=alt.Scale(scheme='category20'),
                      title=''),
      order=alt.Order('percent:N'),
      tooltip=[
                  alt.Tooltip('group:N', title='Field'),
                  alt.Tooltip('percent:Q', format='.2%', title='Cases in race/ethnicity group'),
      ]
).properties(
    title='Percent of race/ethnicity group population who had COVID-19 as of Dec 16'
).display()

The Total group is larger than all the others because it also includes the 45% of cases in the data that didn't have known race/ethnicity.

We can also look at the percent of each age and race/ethnicity group who had COVID-19.

In [ ]:
#@title
# Age x race numbers come from a spreadsheet with manual calculations using BQ and the ACS 1-year estimates.
# Asian and NHPI are combined because the IPUMS data used to calculate the age categories (not available in ACS API)
# split out Asian subgroups, including "Other Asian or Pacific Islander," and so I combined them into one category.
# https://usa.ipums.org/usa-action/variables/RACE#codes_section

race_list = ['Black'] * 9
race_list.extend(['Hispanic/Latino'] * 9)
race_list.extend(['White'] * 9)
race_list.extend(['Asian/NHPI'] * 9)
race_list.extend(['AIAN'] * 9)
race_list.extend(['-Total-'] * 9)
test = pd.DataFrame.from_dict({'group': race_list,
                               'age': ['0-9', '10-19', '20-29', '30-39', '40-49', '50-59', '60-69', '70-79', '80+'] * 6,
                               'percent': [
                                           0.0072,0.0136,0.0275,0.0297,0.0284,0.0285,0.0252,0.0252,0.0325,  # Black
                                           0.0086,0.0172,0.0331,0.0328,0.0346,0.0328,0.0261,0.022,0.0247,  # Hispanic/Latino
                                           0.0063,0.0188,0.0302,0.0234,0.0236,0.0227,0.0184,0.0181,0.0276,  # White
                                           0.0168,0.022,0.0282,0.0234,0.0229,0.027,0.0273,0.032,0.0508,  # Asian/NHPI
                                           0.0169,0.0315,0.0507,0.053,0.047,0.0387,0.0307,0.0251,0.0268,  # AIAN
                                           0.0134,0.0319,0.0573,0.0497,0.0494,0.0463,0.0356,0.0316,0.0453,  # All
                               ]})
alt.Chart(test).mark_bar().encode(
      x=alt.X('percent', sort='y', axis=alt.Axis(format='.0%'), title=''),
      y=alt.Y('age', title='Age'),
      column=alt.Column('group',
                        title='Percent of age and race/ethnicity group population who had COVID-19 as of Dec 16',
                        header=alt.Header(titleFontSize=13)),
      color=alt.Color('group', scale=alt.Scale(scheme='category20'), title='Race/Ethnicity',
                      ),
      order=alt.Order('group:N'),
      tooltip=[
                  alt.Tooltip('group:N', title='Race/Ethnicity group'),
                  alt.Tooltip('age:N', title='Age'),
                  alt.Tooltip('percent:Q', format='.1%', title='Cases in age group'),
      ]
).properties(
  width=110, 
).display()

We can see above that people age 20-29 are more likely to get COVID-19 than any other age group across all race/ethnicity groups. Because different race/ethnicity groups have different age compositions, splitting the cases into cases per age and race/ethnicity group allows us to compare race/ethnicity data against each other without having different age compositions obscure the differences.

You can compare this to the similar chart from the NYT, "Coronavirus cases per 10,000 people, by age and race" in their July, 2020 article.

We can also look at the age adjusted numbers, which use a standard age composition to weight each age and race/ethnicity group against. This allows us to compare the rate of COVID-19 within each race/ethnicity group and remove age composition differences as a factor from the comparison. We can see below that, unlike for COVID-19 death rates, the crude and age-adjusted numbers are fairly similar within each age group except for Asian/NHPI, where the age-adjusted rate is 1.2 percentage points higher.

In [ ]:
#@title
test = pd.DataFrame.from_dict({'group': [
                                         'Black', 'Hispanic/Latino', 'White', 'Asian/NHPI', 'AIAN', '-Total-',
                                         'Black', 'Hispanic/Latino', 'White', 'Asian/NHPI', 'AIAN', '-Total-'
                                         ],
                               'measure': ['Crude', 'Crude', 'Crude', 'Crude', 'Crude', 'Crude',
                                         'Age Adjusted', 'Age Adjusted', 'Age Adjusted', 'Age Adjusted', 'Age Adjusted', 'Age Adjusted'],
                               'percent': [.0234, .0254, .0211, .0129, .0376, .0409,
                                           .0231, .0260, .0206, .0249, .0378, .0403]})
alt.Chart(test).mark_bar().encode(
      x=alt.X('percent', axis=alt.Axis(format='.1%'), title=''),
      y=alt.Y('measure', sort='x', title=''),
      row=alt.Row('group', title='Race/ethnicity'),
      color=alt.Color('measure', title='', scale=alt.Scale(scheme='category20')),
      order=alt.Order('group:N'),
      tooltip=[
                  alt.Tooltip('group:N', title='Field'),
                  alt.Tooltip('measure:N', title='Measure'),
                  alt.Tooltip('percent:Q', format='.2%', title='Cases in race/ethnicity group'),
      ]
).properties(
    title='Percent of race/ethnicity group population who had COVID-19 as of Dec 16'
).display()

Counties: Race/ethnicity

All counties shown below, hover to see the percent of cases with race/ethnicity, the case counts, and the population sizes.

In [ ]:
#@title
((small_charts['cases_per_100']['black'] | small_charts['cases_per_100']['hispanic']) &
 (small_charts['cases_per_100']['white'] | small_charts['cases_per_100']['asian']) &
 (small_charts['cases_per_100']['aian'] | small_charts['cases_per_100']['nhpi'])).configure_legend(
      orient='top',
      gradientLength=400,
      titleLimit=0,
  ).configure_view(
      strokeWidth=0,
  ).display()
In [ ]:
#@title
((small_charts['cases_to_pop']['black'] | small_charts['cases_to_pop']['hispanic']) &
 (small_charts['cases_to_pop']['white'] | small_charts['cases_to_pop']['asian']) &
 (small_charts['cases_to_pop']['aian'] | small_charts['cases_to_pop']['nhpi'])).configure_legend(
      orient='top',
      gradientLength=400,
      titleLimit=0,
  ).configure_view(
      strokeWidth=0,
  ).display()